bias
variance
* In all three cases, the variance increases and the bias decreases as the method’s flexibility increases.
prediction
Inference
flexible
* make a multilinear regression model more flexible - Add predictors, add interaction terms, add nonlinear functions of given predictors. - more flexible higher variance * non-restrictive models
library(ISLR)
library(fma)
lm.fit = lm(mpg~.-name, data=Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.fit)
* A one-unit increase in balance is associated with an increase in the log odds of default(y) by 0.0055 units.
* prediction - default probability for an individual with a balance of $1,000 is 0.005 wihch is less than 1% * Single logistic regression
* Multiple logistic regression
- a student with a credit card balance of $1 , 500 and an income of $40000 has an estimated probability of default is 0.058 - The variables student and balance are correlated.
attach(Weekly)
fit.glm = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume , data = Weekly, family = binomial)
summary(fit.glm)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
resp = predict(fit.glm, type = "response")
pred.glm = rep("Down", length(resp))
pred.glm[resp > 0.5] = "Up"
table(pred.glm, Direction)
## Direction
## pred.glm Down Up
## Down 54 48
## Up 430 557
Neural Nework * more flexible
* Activation function (common) - Sigmoid: \(\frac1{1+e^{-x}}\) - linear: \(y=ax+b\) - hyperbolic tangent: \(tanh(x)\) - ReLU: \(max(0,x)\)
library('nnet')
# nnet single layer
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- 5 + 2*x1 + -3*x2 + rnorm(100)
lm.fit = lm(y~x1+x2) # fit linear model
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98254 -0.55641 0.02562 0.41241 2.55024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.75735 0.09497 50.09 <2e-16 ***
## x1 1.90813 0.08962 21.29 <2e-16 ***
## x2 -3.10182 0.08663 -35.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.936 on 97 degrees of freedom
## Multiple R-squared: 0.9503, Adjusted R-squared: 0.9493
## F-statistic: 927.4 on 2 and 97 DF, p-value: < 2.2e-16
net.fit = nnet(y ~x1+x2,skip=TRUE,size=0,linout = TRUE)
## # weights: 3
## initial value 4797.231735
## final value 84.984215
## converged
# skip = T = no hidden layer
# linout = T = linear output
# linout = F = logistic output
# trace = F = don't display the coverage steps(default = T)
summary(net.fit)
## a 2-0-1 network with 3 weights
## options were - skip-layer connections linear output units
## b->o i1->o i2->o
## 4.76 1.91 -3.10
net.fit2= nnet(y ~x1+x2,size =2, decay = .001, maxit = 20, linout = T,trace = F)
# size =2 2 neuor in hidden layer
summary(net.fit2)
## a 2-2-1 network with 9 weights
## options were - linear output units decay=0.001
## b->h1 i1->h1 i2->h1
## 5.20 3.44 -4.25
## b->h2 i1->h2 i2->h2
## -2.64 1.78 -3.10
## b->o h1->o h2->o
## -1.79 5.81 7.09
Example:
* Want to compare linear vs higher-order polynomial terms in a linear regression * We randomly split the 392 observations into two sets, a training set containing 196 of the data points, and a validation set containing the remaining 196 observations. * left-single split & right multiple splits
LOOCV almost have same size of data, thus it has low bias but high variance
# # Validation Set Approach
library(ISLR)
set.seed(101)
attach(Auto )
library(stats)
# Method 1
train=sample(392,196)
lm.fit=lm(mpg~horsepower , data=Auto ,subset = train )
MSE_V= mean(( mpg - predict(lm.fit , Auto))[- train])^2
##[- train] selects only the observations that are not in the training set.
MSE_V
## [1] 1.28897
# Method 2
# sample_size = floor(0.75*nrow(SSSSS)) #937
# train_index= sample(seq_len(nrow(SSSSS)),size = sample_size)
# train = SSSSS[train_index,]
# test = SSSSS[-train_index,]
# #comment out control-shift-C
#
# ##LOOCV - Leave one out cross validation
# library ( boot)
# glm.fit =glm (mpg~horsepower ,data =Auto)
# cv.err =cv.glm (Auto,glm.fit )
# cv.err$delta #cross-validation results.
#
# # cv for different term
# cv.error =rep (0,10)
# for (i in 1:10) {
# glm.fit = glm (mpg~poly(horsepower ,i),data= Auto)
# cv.error[i]=cv.glm(Auto ,glm.fit )$delta[1]}
# cv.error
# plot(1:10,cv.error,type = 'o', main = 'LOOCV')
# # k-Fold Cross-Validation
# library(boot)
# cv.error.10= rep (0 ,10)
# for (i in 1:10) {
# glm.fit =glm (mpg~poly(horsepower ,i),data= Auto)
# cv.error.10[i]=cv.glm(Auto ,glm.fit ,K=10) $delta [1]
# }
# # cv.glm(...., K=10) 10- fold
# cv.error.10
# plot(1:10,cv.error,type = 'o', main = 'cv with different poly term',col='red')
# lines(1:10,cv.error.10,type = 'o')
set.seed(101)
x = rnorm(100)
noise = rnorm(100)
#generate y
y = 5 + 2*x^7+ noise # beta 7 = 2
df_bestsub = data.frame(y=y,x=x)
library(leaps)
reg_sub = regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = df_bestsub, nvmax = 10)
# method="backward" OR method="forward"
# best subset - method 不填,
reg_summary = summary(reg_sub)
## select by cp, bic, r^2
par(mfrow=c(2,2))
# plot c'p
plot(reg_summary$cp,type='l')
points(which.min(reg_summary$cp),min(reg_summary$cp),col="red") # pints(x,y)
# plot BIC
plot(reg_summary$bic,type='l')
points(which.min(reg_summary$bic),min(reg_summary$bic),col="red")
# plot adj R^2
plot(reg_summary$adjr2,type='l')
points(which.max(reg_summary$adjr2),max(reg_summary$adjr2),col="red")
# According to the plots we can see that all plots holds the critical points at index = 2, which means when there are 2 variables, the model has best performance
coefficients(reg_sub,2)
## (Intercept) I(x^2) I(x^7)
## 4.7808613 0.2626095 2.0069657
# The best subsets approach select 2 predictors : x^2 and x^7
# select by cv
library(glmnet)
library(ISLR)
mat = model.matrix(Salary ~., data = Hitters)
reg_sub = regsubsets(Salary~. ,data = Hitters, nvmax = 19)
reg_summary = summary(reg_sub)
mse= rep(NA,19)
for (i in 1:19){
coefi = coef(reg_sub, id=i)
pred = mat[,names(coefi)] %*% coefi
mse[i] = mean((y-pred)^2)
}
plot(1:19,mse,type='o',col="darkblue",main="Variable selection",xlab='Number of predictors',ylab='MSE')
grid(col = "darkgrey")
points(which.min(mse),mse[which.min(mse)],col = "red",cex = 2,pch=2)
text(which.min(mse)+1,mse[which.min(mse)],round(which.min(mse),2),col = "red")
Cost Function: \[RSS=\Sigma^n_{i=1}\left(y_i-\beta_0-\Sigma^p_{j=1}\beta_jx_{ij}\right)^2\]
library(glmnet)
library(ISLR)
dim(x)
## NULL
x = model.matrix(Salary ~ .-1, data = Hitters)
y = Hitters$Salary
y = na.omit(y)
lam2 =10^seq (10,-2, length =100)
ridge.fit = glmnet (x,y,alpha =0,lambda=lam2)
# without lambda, lambda will auto choose
# alpha = 0 ridge reg , alpha = 1 lasso
dim(coef (ridge.fit ))
## [1] 21 100
# with 21 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of λ).
coef( ridge.fit)[ ,1]
## (Intercept) AtBat Hits HmRun Runs
## 5.359257e+02 5.443467e-08 1.974589e-07 7.956523e-07 3.339178e-07
## RBI Walks Years CAtBat CHits
## 3.527222e-07 4.151323e-07 1.697711e-06 4.673743e-09 1.720071e-08
## CHmRun CRuns CRBI CWalks LeagueA
## 1.297171e-07 3.450846e-08 3.561348e-08 3.767877e-08 5.800263e-07
## LeagueN DivisionW PutOuts Assists Errors
## -5.800262e-07 -7.807263e-06 2.180288e-08 3.561198e-09 -1.660460e-08
## NewLeagueN
## -1.152288e-07
# firstly column close to 0, large penalty
# lost column of coef is large = lm() coef
plot(ridge.fit,xvar="lambda",main='model with orginal x' )
# with larger penalty coe close to 0
ridge.fit2 = glmnet (scale(x),y,alpha =0,lambda=lam2)
plot(ridge.fit2,xvar="lambda",main='model with scale x' )
# xvar = c("norm", "lambda", "dev"),
# find best lambda
set.seed (1)
cv.ridge =cv.glmnet (x,y,alpha =0)
#plot(cv.ridge)
bestlam =cv.ridge$lambda.min
# lambda.min => return the lambda with lowest cv error
#ridge.fit = glmnet (x,y,alpha =0,lambda=lam2)
ridge.pred=predict (ridge.fit ,s=bestlam ,newx=x)
# plot(ridge.fit2,xvar="lambda",main='model with best lambda' )
# abline(v=log(bestlam),col='red')
MSE.ridge = mean(( ridge.pred -y)^2)
lasso.fit = glmnet (scale(x),y,alpha =1 )
plot(lasso.fit)
plot(lasso.fit,xvar='lambda' , main = 'lambda = log(1)')
abline(v=1,col='red')
# lambda = log(1)
# predict( lasso.fit,s=exp(1),type="coefficients")
# 查看具体数据
# exp(1) => v=1 above
# find lambda
cv.lasso = cv.glmnet(x,y, alpha=1)
#plot(cv.lasso)
lam_lasso = cv.lasso$lambda.min # best lam
lam_lasso
## [1] 1.270545
# fit model with the best lambda
# lasso.fit = glmnet (scale(x),y,alpha =1 )
plot(lasso.fit,xvar='lambda', main = 'with best lambda')
abline(v=log(lam_lasso),col='orange')
lasso.pred = predict(lasso.fit, s = lam_lasso , type = "coefficients",main = 'with best lambda')
# label the variable
lbs_fun <- function(fit, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
lbs_fun(lasso.fit)
#display the coef with select lambda
coef(lasso.fit,s=lam_lasso)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.359259e+02
## AtBat -2.775356e+02
## Hits 2.962458e+02
## HmRun 7.811240e+00
## Runs -1.549288e+01
## RBI .
## Walks 1.181660e+02
## Years -4.101294e+01
## CAtBat -1.037636e+02
## CHits .
## CHmRun 2.846453e+01
## CRuns 3.319488e+02
## CRBI 1.652094e+02
## CWalks -1.835379e+02
## LeagueA -2.124819e+01
## LeagueN 3.334554e-11
## DivisionW -5.879292e+01
## PutOuts 7.872171e+01
## Assists 3.886276e+01
## Errors -1.797120e+01
## NewLeagueN -3.320620e+00
\[y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3+...+\beta_dx^d_i+\epsilon_i\]
Wage=ISLR::Wage
fit_ployreg=lm(wage~poly(age ,4) ,data=Wage)
coef(summary (fit_ployreg))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
# fit2a=lm(wage∼age+I(age ^2)+I(age ^3)+I(age ^4) ,data=Wage)
# fit2b=lm(wage∼cbind(age ,age ^2, age ^3, age ^4) ,data=Wage)
# we can use ANOVA to compare the model
# > fit .1= lm(wage∼age ,data=Wage)
# > fit .2= lm(wage∼poly(age ,2) ,data=Wage)
# > fit .3= lm(wage∼poly(age ,3) ,data=Wage)
# > fit .4= lm(wage∼poly(age ,4) ,data=Wage)
# > fit .5= lm(wage∼poly(age ,5) ,data=Wage)
# > anova(fit .1, fit .2, fit .3, fit .4, fit .5)
# Analysis of Variance Table
# Model 1: wage ∼ age
# Model 2: wage ∼ poly(age , 2)
# Model 3: wage ∼ poly(age , 3)
# Model 4: wage ∼ poly(age , 4)
# Model 5: wage ∼ poly(age , 5)
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 2998 5022216
# 2 2997 4793430 1 228786 143.59 <2e-16 ***
# 3 2996 4777674 1 15756 9.89 0.0017 **
# 4 2995 4771604 1 6070 3.81 0.0510 .
# 5 2994 4770322 1 1283 0.80 0.3697
# ---
# Signif . codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
# The p-value comparing the linear Model 1 to the quadratic Model 2 is essentially zero (<10−15), indicating that a linear fit is not sufficient.
# Similarly the p-value comparing the quadratic Model 2 to the cubic Model 3 is very low (0.0017), so the quadratic fit is also insufficient.
# The p-value comparing the cubic and degree-4 polynomials, Model 3 and Model 4, is approximately 5% while the degree-5 polynomial Model 5 seems unnecessary because its p-value is 0.37.
\[C_1(X) = I(X<c_1)=I(X < 35)\\ C_2(X)=I(c_1\leq X<c_2) = I(35 \leq X < 50)\\ . . . ,\\ C_K(X) = I(X \geq c_K)\] \[y_i=\beta_0+\beta_1C_1(x_i)+\beta_2C_2(x_i)+\beta_3C_3(x_i)+...+\beta_KC_K(x_i)+\epsilon_i\]
Their “pieces” may be all linear, or a combination of functional forms (such as constant, linear, quadratic, cubic, square root, cube root, exponential, etc).
with one knot at age = 50
Third order Piecewise Polynomials add constraints
library(ISLR)
library(splines)
agelims=range(Wage$age)
age.grid=seq(from=agelims[1],to=agelims[2])
# Cubic Splines
ns_fit <- lm(wage ~ ns(age, knots=c(20,40,60)), data=Wage)
pred.ns = predict(ns_fit,list(age=age.grid), se=T)
# natural spline
bs_fit <- lm(wage ~ bs(age, knots=c(20,40,60)), data=Wage)
pred.bs = predict(bs_fit,list(age=age.grid), se=T)
plot(Wage$age, Wage$wage, pch=19, col='lightgrey',cex=0.5)
lines(age.grid,pred.ns$fit, col="#3690C0",lwd=4)
lines(age.grid,pred.bs$fit, col="darkturquoise",lwd=1)
legend("topright",c("Cubic Splines","Natural Spline"),col=c("darkturquoise","#3690C0"),lwd=c(1,4))
# natural spline has better behavior at the boundary points
This is also explained in the text book.
\[\min_{g \in S} \Sigma^n_{i=1}(y_i-g(x_i))^2+\lambda \int g''(t)^2 dt\]
The algorithmic details are too complex to describe here. In R, the function smooth.spline() will fit a smoothing spline.
The effective degrees of freedom are given by\(df_{\lambda}=\Sigma^n_{i=1}\{S_{\lambda}\}_{ii}\)
Wage=ISLR::Wage
ss_fit5 <- smooth.spline(Wage$age,Wage$wage,cv=TRUE,df=5,lambda = 0.5)
ss_fit25 <- smooth.spline(Wage$age,Wage$wage,cv=TRUE,df=5,lambda = 5)
ss_fit50 <- smooth.spline(Wage$age,Wage$wage,cv=TRUE,df=5,lambda = 500)
# step1: cv choose pently ,
# step2: split the data
# step3:pick lamdba from grid eg[0,10]
# step4: train the model by x_train, evaluate the x_test
plot(Wage$age, Wage$wage, pch=19, col='grey')
lines(ss_fit5,col="orangered", lwd=1)
lines(ss_fit25,col="blue", lwd=1)
lines(ss_fit50,col="goldenrod1", lwd=1)
# df larger, less smooth
legend('topright',col=c('orangered','blue','goldenrod1'),c('lambda = 0.5','lambda = 5','lambda = 500'),lty=1)
Algorithm: local refression at \(X=x_0\)
Wage=ISLR::Wage
loreg.fit1=loess(wage~age,data=Wage, span=0.1)
pred.lo1 = predict(loreg.fit1)
loreg.fit2=loess(wage~age,data=Wage, span=0.5)
pred.lo2 = predict(loreg.fit2)
loreg.fit3=loess(wage~age,data=Wage, span=1)
pred.lo3 = predict(loreg.fit3)
par(mfrow=c(1,3))
plot(Wage$age, Wage$wage, pch=19, col='grey')
lines(Wage$age,pred.lo1, col="darkturquoise",lwd=1)
plot(Wage$age, Wage$wage, pch=19, col='grey')
lines(Wage$age,pred.lo2, col="lightpink",lwd=1)
plot(Wage$age, Wage$wage, pch=19, col='grey')
lines(Wage$age,pred.lo3, col="lightgoldenrodyellow",lwd=1)
# span 越大越smooth
\[y_i=\beta_0+f_1(x_{i1})+f_2(x_{i2})+...+f_p(x_{ip})+\epsilon_i\]
estimates: \(\hat{ \beta_0},\hat{f_1},\hat{f_2}...,\hat{f_p}\)
can use smoothing splines or local regression as well
Wage=ISLR::Wage
attach(Wage)
library(splines)
library(gam)
par(mfrow=c(1,3))
### regression model
attach(Wage)
# Method 1: can fit a GAM simply using, eg natural splines
fit.lm.wage=lm(wage~ns(year, df = 5) + ns(age, df = 5) + education)
summary(fit.lm.wage)
##
## Call:
## lm(formula = wage ~ ns(year, df = 5) + ns(age, df = 5) + education)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.646 -19.715 -3.452 14.293 214.068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.728 4.709 9.922 < 2e-16 ***
## ns(year, df = 5)1 2.109 3.178 0.663 0.50708
## ns(year, df = 5)2 10.415 3.787 2.750 0.00599 **
## ns(year, df = 5)3 2.292 3.400 0.674 0.50039
## ns(year, df = 5)4 9.585 4.051 2.366 0.01803 *
## ns(year, df = 5)5 6.080 2.420 2.513 0.01203 *
## ns(age, df = 5)1 45.053 4.195 10.739 < 2e-16 ***
## ns(age, df = 5)2 38.476 5.076 7.580 4.59e-14 ***
## ns(age, df = 5)3 34.041 4.387 7.759 1.16e-14 ***
## ns(age, df = 5)4 48.741 10.572 4.610 4.19e-06 ***
## ns(age, df = 5)5 6.737 8.369 0.805 0.42091
## education2. HS Grad 11.038 2.431 4.541 5.83e-06 ***
## education3. Some College 23.510 2.562 9.176 < 2e-16 ***
## education4. College Grad 38.359 2.548 15.057 < 2e-16 ***
## education5. Advanced Degree 62.601 2.762 22.667 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.16 on 2985 degrees of freedom
## Multiple R-squared: 0.2933, Adjusted R-squared: 0.2899
## F-statistic: 88.47 on 14 and 2985 DF, p-value: < 2.2e-16
plot.Gam(fit.lm.wage,se=T,col='mediumvioletred')
# Method 2: can use smoothing splines or local regression as well:
fit.gam.wage=gam(wage ~ s(year, df = 5) + lo(age, span = .5) + education)
#summary(fit.gam.wage)
plot(fit.gam.wage,se=T,col='mediumvioletred')
# x-variable, y-response
# For variable ‘year’ the Salaries tend to increase , and it seems that there is a decrease in salary at around year 2007 or 2008. And for the Categorical variable ‘education’ , Salary also increases monotonically.
# It does not makes a difference if we use gam() or lm() to fit Generalized Additive Models.
### logistic Regression Model
# identity I() function to convert the Response to a Binary variable.
fit.logit.gam = gam(I(wage > 250) ~ s(year, df = 5)+s(age, df = 4) + education, family = binomial,data=Wage)
plot(fit.logit.gam,se=T,col='mediumvioletred')
# P(wage>250 | Xi) and P(wage<250 | Xi)
# The above Plots are the same as the first Model,difference is that the Y-axis will now be the Logit of the Probability values
# In the above Plot for ‘Year’ variable we can see that the error bands are quiet wide and broad.
# This might be an indication that our Non linear function fitted for ‘Year’ variable is not significant.
fit.logit.gam2 = gam(I(wage > 250) ~ year+s(age, df = 4) + education, family = binomial,data=Wage)
summary(fit.logit.gam2)
##
## Call: gam(formula = I(wage > 250) ~ year + s(age, df = 4) + education,
## family = binomial, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.56245 -0.27321 -0.12192 -0.08593 3.24298
##
## (Dispersion Parameter for binomial family taken to be 1)
##
## Null Deviance: 730.5345 on 2999 degrees of freedom
## Residual Deviance: 603.7774 on 2990 degrees of freedom
## AIC: 623.7775
##
## Number of Local Scoring Iterations: 16
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 0.46 0.4581 0.5671 0.45147
## s(age, df = 4) 1 4.20 4.2028 5.2023 0.02263 *
## education 4 66.32 16.5794 20.5224 < 2e-16 ***
## Residuals 2990 2415.52 0.8079
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar Chisq P(Chi)
## (Intercept)
## year
## s(age, df = 4) 3 10.001 0.01856 *
## education
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit.logit.gam,fit.logit.gam2,test='Chisq')
## Analysis of Deviance Table
##
## Model 1: I(wage > 250) ~ s(year, df = 5) + s(age, df = 4) + education
## Model 2: I(wage > 250) ~ year + s(age, df = 4) + education
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2986 602.09
## 2 2990 603.78 -4 -1.6906 0.7924
# The p-value (0.7924) suggests that we fail to reject the null hypothesis and conclude that model 1 is not signficant different from second model
# if a very small p-value (< .001). This means that the changing in model2 did lead to a significantly improved fit over the model 1
Decision trees can be applied to both regression and classification problems.
Select an optimal value \(\alpha\) by CV
used the alpha we select back to the full size tree – final tree Summary tree algorithm:
retrun subtree with chosen \(\alpha\)
process:
Trees Versus Linear Models
PCA produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated. Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization.
PCA vs Clustering